19. Quiz - K-means

We might want to take a look at the distribution of the Title+Body length feature we used before and instead of using the raw number of words create categories based on this length: short, longer,…, super long.

In the questions below I'll refer to length of the combined Title and Body fields as Description Length (and by length we mean the number of words when the text is tokenized with pattern="\W").

How many times greater is the Description Length of the longest question than the Description Length of the shortest question (rounded to the nearest whole number)?

Tip: Don't forget to import Spark SQL's aggregate functions that can operate on DataFrame columns.

SOLUTION: 753

What is the mean and standard deviation of the Description length?

SOLUTION: 180, 192

Let's use K-means to create 5 clusters of Description Lengths. Set the random seed to 42 and fit a 5-class K-means model on the Description Length column (you can use KMeans().setParams(…) ).
What length is the center of the cluster representing the longest questions?

SOLUTION: 2634